Similarity Computation in Novelty Detection and Biomedical Text Categorization
نویسندگان
چکیده
The novelty track was first introduced in TREC 2002. Given a TREC topic, the goal of this task in 2004 is to locate relevant and new information from a set of documents. From the results in TREC 2002 and 2003, we realized the major challenging issue of recognizing relevant sentences is the lack of information used in similarity computation among sentences. In this year, we utilized the method based on variants of employing an information retrieval (IR) system to find relevant and novel sentences. This methodology is called IR with reference corpus, which can also be considered as an information expansion of sentences. A sentence is considered as a query of a reference corpus, and similarity between sentences is measured in terms of the weighting vectors of document lists ranked by IR systems. Basically, relevant sentences are extracted by comparing their results on a certain information retrieval system. Two sentences are regarded as similar if their corresponding returned document lists by the IR system are similar. In novelty parts, we used similar approach to extract novel sentences from the sentences of the relevant part. An effectively dynamic threshold setting approach that is based on what percentage of relevant sentences is within a relevant document is presented. In this paper, we paid attention to three points: first, how to utilize the results of an IR system to compare the similarity between sentences; second, how to filter out the redundant sentences; third, how to determine appropriate relevance and novelty threshold.
منابع مشابه
Some Similarity Computation Methods in Novelty Detection
In the novelty task, the amount of information of a sentence that can be used in similarity computation is the major challenging issue. Some sort of information expansion methods was introduced to tackle this problem. Our approach to relevance identification was to expand the information of a sentence with the context of this sentence using a sliding window method. The similarity was measured b...
متن کاملLinear-Time Computation of Similarity Measures for Sequential Data
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams...
متن کاملFinding Topic-specific Strings in Text Categorization and Opinion Mining Contexts
In this paper, we present a new probabilistic method for automatically extracting topic-specific strings in a text categorization context. The advantage of this method is twofold. First, it allows us to automatically point out the expressions characterizing a specific topic category for a potential knowledge modelling. Second, it contributes to improve categorization results by providing to the...
متن کاملGenetic Algorithm Based Text Categorization Using OLEX Method
The system describes new similarity-based genetic algorithm (GA) and thresholding Strategies (R&SCut variants). GA was designed to give appropriate weights to terms according to their semantic content and importance by using their co-occurrence information and the discriminating power values for similarity computation. After investigating the existing common thresholding strategies, design mult...
متن کاملText Categorization with a Small Number of Labeled Training Examples
This thesis describes the investigation and development of supervised and semisupervised learning approaches to similarity-based text categorization systems. It uses a small number of manually labeled examples for training and still maintains effectiveness. The purpose of text categorization is to automatically assign arbitrary raw documents to predefined categories based on their contents. Tex...
متن کامل